In [1]:
# Computations
import pandas as pd
import numpy as np
import calendar

# sklearn
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Pytorch
import torch
from torch.autograd import Variable
import torch.nn as nn 
import torchvision.transforms as transforms

# Visualisation libraries

## Progress Bar
import progressbar

## Text
from colorama import Fore, Back, Style
from IPython.display import Image, display, Markdown, Latex, clear_output

## plotly
from plotly.offline import init_notebook_mode, iplot 
import plotly.graph_objs as go
import plotly.offline as py
from plotly.subplots import make_subplots
import plotly.express as px

## seaborn
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("paper", rc={"font.size":12,"axes.titlesize":14,"axes.labelsize":12})

## matplotlib
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
from matplotlib.patches import Ellipse, Polygon
import matplotlib.gridspec as gridspec
import matplotlib.colors
from pylab import rcParams
plt.style.use('seaborn-whitegrid')
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = (17, 6)
mpl.rcParams['axes.labelsize'] = 14
mpl.rcParams['xtick.labelsize'] = 12
mpl.rcParams['ytick.labelsize'] = 12
mpl.rcParams['text.color'] = 'k'
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")
Bank Marketing Dataset

In this article, we work on a dataset available from the UCI Machine Learning Repository. The data is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe to a term deposit (variable y).

This dataset is based on the Bank Marketing dataset from the UC Irvine Machine Learning Repository. The data is enriched by the addition of five new social and economic features/attributes (national wide indicators from a ~10M population country), published by the Banco de Portugal and publicly available at bportugal.pt/estatisticasweb. This dataset is almost identical to the one used in [Moro et al., 2014] (it does not include all attributes due to privacy concerns).

Dataset Information:

The data is related to the direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

There are four datasets:

  1. bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014]
  2. bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.
  3. bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with fewer inputs).
  4. bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with fewer inputs).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Loading the Dataset

The zip file includes two datasets:

  1. bank-additional-full.csv with all examples, ordered by date (from May 2008 to November 2010).
  2. bank-additional.csv with 10% of the examples (4119), randomly selected from bank-additional-full.csv. The smallest dataset is provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The binary classification goal is to predict if the client will subscribe to a bank term deposit (variable y).

In [2]:
Data = pd.read_csv('Data/Bank_mod.csv')
display(Data.head().round(2))
Age Job Marital Education Default Housing Loan Contact Month Day Of Week ... Campaign Pdays Previous Poutcome Employment Variation Rate Consumer Price Index Consumer Confidence Index Euribor three Month Rate Number of Employees Term Deposit Subscription
0 56 Housemaid Married Basic.4Y No No No Telephone May Monday ... 1 999 0 Nonexistent 1.1 93.99 -36.4 4.86 5191.0 No
1 57 Services Married High.School Unknown No No Telephone May Monday ... 1 999 0 Nonexistent 1.1 93.99 -36.4 4.86 5191.0 No
2 37 Services Married High.School No Yes No Telephone May Monday ... 1 999 0 Nonexistent 1.1 93.99 -36.4 4.86 5191.0 No
3 40 Admin. Married Basic.6Y No No No Telephone May Monday ... 1 999 0 Nonexistent 1.1 93.99 -36.4 4.86 5191.0 No
4 56 Services Married High.School No No Yes Telephone May Monday ... 1 999 0 Nonexistent 1.1 93.99 -36.4 4.86 5191.0 No

5 rows × 21 columns

Number of Instances Number of Attributes
41188 21

Bank Client Data

Feature Description
Age numeric
Job Type of Job (categorical: "admin.","blue-collar","entrepreneur","housemaid","management","retired","self-employed","services","student","technician","unemployed","unknown")
Marital marital status (categorical: "divorced","married","single","unknown"; note: "divorced" means divorced or widowed)
Education (categorical: "basic.4y","basic.6y","basic.9y","high.school","illiterate","professional.course","university.degree","unknown")
Default has credit in default? (categorical: "no","yes","unknown")
Housing has housing loan? (categorical: "no","yes","unknown")
Loan has personal loan? (categorical: "no","yes","unknown")
Feature Description
Contact contact communication type (categorical: "cellular","telephone")
Month last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
Day of week last contact day of the week (categorical: "mon","tue","wed","thu","fri")
Duration last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y="no"). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

Other Attributes

Feature Description
Campaign number of contacts performed during this campaign and for this client (numeric, includes last contact)
Pdays number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
Previous number of contacts performed before this campaign and for this client (numeric)
Poutcome outcome of the previous marketing campaign (categorical: "failure","nonexistent","success")

Social and Economic Context Attributes

Feature Description
Employment Variation Rate employment variation rate - quarterly indicator (numeric)
Consumer Price Index consumer price index - monthly indicator (numeric)
Consumer Confidence Index consumer confidence index - monthly indicator (numeric)
Euribor three Month Rate euribor* 3 month rate - daily indicator (numeric)
Number of Employees number of employees - quarterly indicator (numeric)

* the basic rate of interest used in lending between banks on the European Union interbank market and also used as a reference for setting the interest rate on other loans.

Output variable (Desired Target):

Feature Description
Term Deposit Subscription has the client Term Deposit Subscription? (binary: "yes","no")
In [3]:
Dataset_Subcategories = {}
Dataset_Subcategories['Bank Client Data'] = Data.iloc[:,:7].columns.tolist()
Dataset_Subcategories['Related with the Last Contact of the Current Campaign'] = Data.iloc[:,7:11].columns.tolist()
Dataset_Subcategories['Other Attributes'] = Data.iloc[:,11:15].columns.tolist()
Dataset_Subcategories['Social and Economic Context Attributes'] = Data.iloc[:,15:-1].columns.tolist()
Dataset_Subcategories['Output variable (Desired Target)'] = Data.iloc[:,-1:].columns.tolist()

Preparing the Dataset

In [4]:
def List_Print(Text, List):
    print(Back.BLACK + Fore.CYAN + Style.NORMAL + '%s:' % Text + Style.RESET_ALL + ' %s' % ', '.join(List))

Categorical Variables

In [5]:
def Data_Plot(Inp):
    data_info = Inp.dtypes.astype(str).to_frame(name='Data Type')
    Temp = Inp.isnull().sum().to_frame(name = 'Number of NaN Values')
    data_info = data_info.join(Temp, how='outer')
    data_info ['Size'] = Inp.shape[0]
    data_info['Percentage'] = 100 - np.round(100*(data_info['Number of NaN Values']/Inp.shape[0]),2)
    data_info.index.name = 'Features'
    data_info = data_info.reset_index(drop = False)
    #
    fig = px.bar(data_info, x= 'Features', y= 'Percentage', color = 'Data Type', text = 'Data Type',
                color_discrete_sequence = ['PaleGreen', 'LightCyan', 'PeachPuff', 'Pink', 'Plum'],
                 hover_data = data_info.columns)
    fig.update_layout(plot_bgcolor= 'white', legend=dict(x=1.01, y=.5, traceorder="normal",
                                                         bordercolor="DarkGray", borderwidth=1))
    fig.update_traces(texttemplate= 6*' ' + '%{label}', textposition='inside')
    fig.update_traces(marker_line_color= 'Black', marker_line_width=1., opacity=1)
    fig.show()
    
def dtypes_group(Inp):
    Temp = Inp.dtypes.to_frame(name='Data Type').sort_values(by=['Data Type'])
    Out = pd.DataFrame(index =Temp['Data Type'].unique(), columns = ['Features','Count'])
    for c in Temp['Data Type'].unique():
        Out.loc[Out.index == c, 'Features'] = [Temp.loc[Temp['Data Type'] == c].index.tolist()]
        Out.loc[Out.index == c, 'Count'] = len(Temp.loc[Temp['Data Type'] == c].index.tolist())
    Out.index.name = 'Data Type'
    Out = Out.reset_index(drop = False)
    Out['Data Type'] = Out['Data Type'].astype(str)
    return Out

Data_Plot(Data)
dType = dtypes_group(Data)

Yes/No Features

First, let's convert all Yes/No columns using as follows

$$\begin{cases} -1 & \mbox{Unknown}\\0 &\mbox{No}\\ 1 &\mbox{Yes}\end{cases}$$
In [6]:
df = Data.copy()
Categorical_Variables = dType.loc[dType['Data Type'] == 'object'].values[0,1]
YN_Feat = []
for c in Categorical_Variables:
    s = set(df[c].unique().tolist())
    if s.issubset({'No', 'Yes', 'Unknown'}):
        YN_Feat.append(c)
del c, s
List_Print('Yes/No Features', YN_Feat)

# Converting:
Temp = {'Yes':1, 'No':0, 'Unknown':-1}
for c in YN_Feat:
    df[c] = df[c].replace(Temp).astype(int)    
del c
display(df[YN_Feat].head().style.hide_index())

## Adding these keys and values to a dictionary
CatVar_dict = {}
for c in YN_Feat:
    CatVar_dict[c] = Temp
    
#substracting YN Features from Categorical_Variables
Categorical_Variables = list(set(Categorical_Variables) - set(YN_Feat))
del YN_Feat, Temp
Yes/No Features: Housing, Loan, Default, Term Deposit Subscription
Housing Loan Default Term Deposit Subscription
0 0 0 0
0 0 -1 0
1 0 0 0
0 0 0 0
0 1 0 0

Moreover,

In [7]:
List_Print('Remaining categorical features', Categorical_Variables)
Remaining categorical features: Day Of Week, Poutcome, Month, Job, Education, Contact, Marital

Poutcome

For these features, we have,

$$\mbox{Poutcome} = \begin{cases} -1 & \mbox{Nonexistent}\\0 &\mbox{Failure}\\ 1 &\mbox{Success}\end{cases}$$
In [8]:
Temp = {'Success':1, 'Failure':0, 'Nonexistent':-1}
df['Poutcome'] = df['Poutcome'].replace(Temp).astype(int)  
CatVar_dict['Poutcome'] = Temp
del Temp

Marital

$$\mbox{Poutcome} = \begin{cases} -1 & \mbox{Unknown}\\0 &\mbox{Single}\\ 1 &\mbox{Married}\\ 2 &\mbox{Divorced}\end{cases}$$
In [9]:
Temp = {'Divorced':2, 'Married':1, 'Single':0, 'Unknown':-1}
df['Marital'] = df['Marital'].replace(Temp).astype(int)  
CatVar_dict['Marital'] = Temp

Day Of Week

$$\mbox{Day Of Week} = \begin{cases} 0 & \mbox{Monday}\\ 1 &\mbox{Tuesday}\\ 2 &\mbox{Wednesday}\\ 3 &\mbox{Thursday}\\ 4 &\mbox{Friday}\\ 5 &\mbox{Saturday}\\ 6 &\mbox{Sunday} \end{cases}$$
In [10]:
Temp = [x for x in calendar.day_name]
Temp0 = {}
for x in np.arange(len(Temp)):
    Temp0[Temp[x]] = x
del Temp
df['Day Of Week'] = df['Day Of Week'].replace(Temp0).astype(int)  
CatVar_dict['Day Of Week'] = Temp0

Contact

$$\mbox{Contact} = \begin{cases} 0 & \mbox{Telephone}\\ 1 &\mbox{Cellular} \end{cases}$$
In [11]:
Temp = {'Telephone':0, 'Cellular':1}
df['Contact'] = df['Contact'].replace(Temp).astype(int)  
CatVar_dict['Contact'] = Temp

Job

In [12]:
Temp = {'Unknown':-1, 'Unemployed':0, 'Student': 1, 'Housemaid':2, 
        'Retired':3, 'Blue-Collar':4, 'Self-Employed': 5, 'Services':6,
        'Technician':7, 'Admin.':8, 'Management':9, 'Entrepreneur':10 }
df['Job'] = df['Job'].replace(Temp).astype(int)  
CatVar_dict['Job'] = Temp

Month

In [13]:
Temp = [x for x in calendar.month_name]
Temp = Temp[1:]
Temp0 = {}
for x in np.arange(len(Temp)):
    Temp0[Temp[x]] = x
del Temp

df['Month'] = df['Month'].replace(Temp0).astype(int)  
CatVar_dict['Month'] = Temp0

Education

In [14]:
Temp = {'Unknown':-1, 'Illiterate':0, 'Basic.4Y':1, 'Basic.6Y':2, 'Basic.9Y':3, 'High.School':4,
        'Professional.Course':5,  'University.Degree':6}
df['Education'] = df['Education'].replace(Temp).astype(int)  
CatVar_dict['Education'] = Temp

Categorical_Variables = CatVar_dict
del CatVar_dict

Pdays

In [15]:
df.loc[df['Pdays'] == 999, 'Pdays'] = -1

Age Group and Age Category

Creating new features:

  • Age Group
  • Age Category

We can create Age Categories using statcan.gc.ca.

Interval Age Category Age Category Code
00-14 years Children 0
15-24 years Youth 1
25-64 years Adults 2
65 years and over Seniors 3
In [16]:
bins = pd.IntervalIndex.from_tuples([(14, 24), (24, 64),(64, 100)])
Temp = pd.cut(df['Age'], bins)
df['Age'] = Temp.astype(str).replace({'(14, 24]':0, '(24, 64]':1,'(64, 100]':2})

Therefore,

In [17]:
display(df.head())
Features Age Job Marital Education Default Housing Loan Contact Month Day Of Week ... Campaign Pdays Previous Poutcome Employment Variation Rate Consumer Price Index Consumer Confidence Index Euribor three Month Rate Number of Employees Term Deposit Subscription
0 1 2 1 1 0 0 0 0 4 0 ... 1 -1 0 -1 1.1 93.994 -36.4 4.857 5191.0 0
1 1 6 1 4 -1 0 0 0 4 0 ... 1 -1 0 -1 1.1 93.994 -36.4 4.857 5191.0 0
2 1 6 1 4 0 1 0 0 4 0 ... 1 -1 0 -1 1.1 93.994 -36.4 4.857 5191.0 0
3 1 8 1 2 0 0 0 0 4 0 ... 1 -1 0 -1 1.1 93.994 -36.4 4.857 5191.0 0
4 1 6 1 4 0 0 1 0 4 0 ... 1 -1 0 -1 1.1 93.994 -36.4 4.857 5191.0 0

5 rows × 21 columns

In [18]:
Target = 'Term Deposit Subscription'
Labels = ['No', 'Yes']

Feat = 'Duration'
X = df[Feat].values.reshape(-1,1)
Test = np.arange(df[Feat].min(), df[Feat].max()).reshape(-1,1)
y = df[Target].values.reshape(-1,1)
logr = LogisticRegression(solver='newton-cg')
_ = logr.fit(X, y)
Pred_Prop = logr.predict_proba(Test)

fig, ax = plt.subplots(1, 1, figsize=(16,5))

# Right plot
_ = ax.scatter(X, y, color='HotPink', edgecolor = 'DeepPink')
_ = ax.plot(Test, Pred_Prop[:,1], color='MidnightBlue', lw = 1)
Temp = ax.get_xlim()
_ = ax.hlines(0, Temp[0], Temp[1], linestyles='dashed', lw=1)
_ = ax.hlines(1, Temp[0], Temp[1], linestyles='dashed', lw=1)
_ = ax.set_xlim(Temp)
_ = ax.set_xlabel('Last Contact Duration (in seconds)')
_ = ax.set_ylabel('Probability of %s' % Target)
_ = ax.set_title('Estimated Probability of %s using Logistic Regression' % Target )

del X, Test, y, logr, Pred_Prop, Temp
In [19]:
Feat = 'Pdays'
X = df[Feat].values.reshape(-1,1)
Test = np.arange(df[Feat].min(), df[Feat].max()).reshape(-1,1)
y = df[Target].values.reshape(-1,1)
logr = LogisticRegression(solver='newton-cg')
_ = logr.fit(X, y)
Pred_Prop = logr.predict_proba(Test)

fig, ax = plt.subplots(1, 1, figsize=(16,5))

# Right plot
_ = ax.scatter(X, y, color='Lime', edgecolor = 'LimeGreen', lw = 1)
_ = ax.plot(Test, Pred_Prop[:,1], color='MidnightBlue', lw = 1)
Temp = ax.get_xlim()
_ = ax.hlines(0, Temp[0], Temp[1], linestyles='dashed', lw=1)
_ = ax.hlines(1, Temp[0], Temp[1], linestyles='dashed', lw=1)
_ = ax.set_xlim(Temp)
_ = ax.set_xlabel('Days Passed Since the Previous Campaign Contact')
_ = ax.set_ylabel('Probability of %s' % Target)
_ = ax.set_title('Estimated Probability of %s using Logistic Regression' % Target )

del X, Test, y, logr, Pred_Prop, Temp

Data Correlations

Let's take a look at the variance of the features.

In [20]:
Fig, ax = plt.subplots(figsize=(17,16))
Temp = df.drop(columns = [Target]).var().sort_values(ascending = False).to_frame(name= 'Variance').round(2).T

_ = sns.heatmap(Temp, ax=ax, annot=True, square=True,  cmap =sns.color_palette("OrRd", 20),
                  linewidths = 0.8, vmin=0, vmax=Temp.max(axis =1)[0],
                  cbar_kws={'label': 'Feature Variance', "aspect":40, "shrink": .4, "orientation": "horizontal"})
labels = [x.replace(' ','\n').replace('Euribor\nthree','Euribor three').replace('\nof\n',' of\n')
          for x in [item.get_text() for item in ax.get_xticklabels()]]
_ = ax.set_xticklabels(labels)
_ = ax.set_yticklabels('')

Furthermore, we would like to standardize features by removing the mean and scaling to unit variance. In this article, we demonstrated the benefits of scaling data using StandardScaler().

In [21]:
# Scaling
Temp = df.drop(columns = Target).columns.tolist()
scaler = StandardScaler()
_ = scaler.fit(df[Temp])
df[Temp] = scaler.transform(df[Temp])

# Variance Plot
Fig, ax = plt.subplots(figsize=(17,16))

Temp = df.drop(columns = [Target]).var().sort_values(ascending = False).to_frame(name= 'Variance').round(2).T

_ = sns.heatmap(Temp, ax=ax, annot=True, square=True,  cmap =sns.color_palette('Greens'),
                  linewidths = 0.8, vmin=0, vmax=Temp.max(axis =1)[0],
                  cbar_kws={'label': 'Feature Variance', "aspect":40, "shrink": .4, "orientation": "horizontal"})
labels = [x.replace(' ','\n').replace('Euribor\nthree','Euribor three').replace('\nof\n',' of\n')
          for x in [item.get_text() for item in ax.get_xticklabels()]]
_ = ax.set_xticklabels(labels)
_ = ax.set_yticklabels('')
In [22]:
def Correlation_Plot (Df,Fig_Size):
    Correlation_Matrix = Df.corr().round(2)
    mask = np.zeros_like(Correlation_Matrix)
    mask[np.triu_indices_from(mask)] = True
    for i in range(len(mask)):
        mask[i,i]=0
    Fig, ax = plt.subplots(figsize=(Fig_Size,Fig_Size))
    sns.heatmap(Correlation_Matrix, ax=ax, mask=mask, annot=True, square=True, 
                cmap =sns.color_palette("Greens", n_colors=10), linewidths = 0.2, vmin=0, vmax=1, cbar_kws={"shrink": .6})
    
Correlation_Plot (df, Fig_Size = 14)
In [23]:
Fig, ax = plt.subplots(figsize=(17,16))
Temp = df.corr().round(2)
Temp = Temp.loc[(Temp.index == Target)].drop(columns = Target).T.sort_values(by = Target).T
_ = sns.heatmap(Temp, ax=ax, annot=True, square=True,  cmap =sns.color_palette("Greens", n_colors=10),
                linewidths = 0.8, vmin=0, vmax=1,
                annot_kws={"size": 12},
                cbar_kws={'label': Target + ' Correlation', "aspect":40, "shrink": .4, "orientation": "horizontal"})
labels = [x.replace(' ','\n').replace('Euribor\nthree','Euribor three').replace('\nof\n',' of\n')
          for x in [item.get_text() for item in ax.get_xticklabels()]]
_ = ax.set_xticklabels(labels)
_ = ax.set_yticklabels('')

Train and Test Sets

In [24]:
X = df.drop(columns = Target).values
y = df[Target].astype(float).values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
pd.DataFrame(data={'Set':['X_train','X_test','y_train','y_test'],
               'Shape':[X_train.shape, X_test.shape, y_train.shape, y_test.shape]}).set_index('Set').T
Out[24]:
Set X_train X_test y_train y_test
Shape (28831, 20) (12357, 20) (28831,) (12357,)

Modeling: PyTorch Logistic Regression with GPU

Next, We try the logistic regression iteratively using an optimization algorithm. The algorithm at each iteration uses the Cross-Entropy Loss to measure the loss, and then the gradient and the model update is calculated. At the end of this iterative process, we would reach a better level of agreement between test and predicted sets since the error would be lower from that of the first step.

Setting up Tensor Arrays

In [25]:
if torch.cuda.is_available():
    X_train_tensor = Variable(torch.from_numpy(X_train).cuda())
    y_train_tensor = Variable(torch.from_numpy(y_train).type(torch.LongTensor).cuda())
    X_test_tensor = Variable(torch.from_numpy(X_test).cuda())
    y_test_tensor = Variable(torch.from_numpy(y_test).type(torch.LongTensor).cuda())
else:
    X_train_tensor = Variable(torch.from_numpy(X_train))
    y_train_tensor = Variable(torch.from_numpy(y_train).type(torch.LongTensor))
    X_test_tensor = Variable(torch.from_numpy(X_test))
    y_test_tensor = Variable(torch.from_numpy(y_test).type(torch.LongTensor))
    
Batch_size = 100
iteration_number = 5e3

epochs_number = int(iteration_number / (len(X_train) / Batch_size))

# Pytorch train and test sets
Train_set = torch.utils.data.TensorDataset(X_train_tensor, y_train_tensor)
Test_set = torch.utils.data.TensorDataset(X_test_tensor, y_test_tensor)

# data loader
train_loader = torch.utils.data.DataLoader(Train_set, batch_size = Batch_size, shuffle = False)
test_loader = torch.utils.data.DataLoader(Train_set, batch_size = Batch_size, shuffle = False)

Modeling

In [26]:
class LogisticRegressionModel(torch.nn.Module):
    def __init__(self, input_Size, output_Size):
        super(LogisticRegressionModel, self).__init__()
        self.linear = torch.nn.Linear(input_Size, output_Size)
    
    def forward(self, x):
        out = self.linear(x)
        return out

input_Size, output_Size = len(X[0]), len(np.unique(y))

# model
model = LogisticRegressionModel(input_Size, output_Size)

# GPU
if torch.cuda.is_available():
    model.cuda()

# Cross Entropy Loss 
CEL= nn.CrossEntropyLoss()

# Optimizer 
learning_rate = 1e-2
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

# Traning the Model
Count = 0
Loss_list = []
Iteration_list = []
Accuracy_list = []
MSE_list = []
MAE_list = []
Steps = 10

Progress_Bar = progressbar.ProgressBar(maxval= iteration_number + 200,
                                       widgets=[progressbar.Bar('=', '|', '|'),
                                                progressbar.Percentage()])
# print('---------------------------------------------------------')
for epoch in range(epochs_number):
    for i, (Xtr, ytr) in enumerate(train_loader):
        
        # Variables
        Xtr = Variable(Xtr.view(-1, X[0].shape[0]))
        ytr = Variable(ytr)
        
        # Set all gradients to zero
        optimizer.zero_grad()
        
        # Forward
        Out = model(Xtr.float())
        
        # loss
        loss = CEL(Out, ytr.long())
        
        # Backward (Calculating the gradients)
        loss.backward()
        
        # Update parameters
        optimizer.step()
        
        Count += 1
        
        del Xtr, ytr
        
        # Predictions
        if Count % Steps == 0:
            # Calculate Accuracy         
            Correct, Total = 0, 0
            # Predictions
            for Xts, yts in test_loader: 
                Xts = Variable(Xts.view(-1, X[0].shape[0]))
                
                # Forward
                Out = model(Xts.float())
                
                # The maximum value of Out
                Predicted = torch.max(Out.data, 1)[1]
                
                # Total number of yts
                Total += len(yts)
                
                # Total Correct predictions
                Correct += (Predicted == yts).sum()
            del Xts, yts
            # storing loss and iteration
            Loss_list.append(loss.data)
            Iteration_list.append(Count)
            Accuracy_list.append(Correct / float(Total))
            
        Progress_Bar.update(Count)

Progress_Bar.finish()

history = pd.DataFrame({'Iteration': np.array(Iteration_list),
                      'Loss': np.array([x.cpu().data.numpy() for x in Loss_list]),
                      'Accuracy': np.array([x.cpu().data.numpy() for x in Accuracy_list])})
|=========================================================================|100%

Model Optimization Plot

In [27]:
def Plot_history(history, Table_Rows = 25, yLim = 2):
    fig = make_subplots(rows=1, cols=2, horizontal_spacing = 0.02, column_widths=[0.6, 0.4],
                        specs=[[{"type": "scatter"},{"type": "table"}]])
    # Left
    fig.add_trace(go.Scatter(x= history['Iteration'].values, y= history['Loss'].astype(float).values.round(4),
                             line=dict(color='OrangeRed', width= 1.5), name = 'Loss'), 1, 1)
    fig.add_trace(go.Scatter(x= history['Iteration'].values, y= history['Accuracy'].astype(float).values,
                             line=dict(color='MidnightBlue', width= 1.5),  name = 'Accuracy'), 1, 1)
    fig.update_layout(legend=dict(x=0, y=1.1, traceorder='reversed', font_size=12),
                  dragmode='select', plot_bgcolor= 'white', height=600, hovermode='closest',
                  legend_orientation='h')
    fig.update_xaxes(range=[history.Iteration.min(), history.Iteration.max()],
                     showgrid=True, gridwidth=1, gridcolor='Lightgray',
                     showline=True, linewidth=1, linecolor='Lightgray', mirror=True, row=1, col=1)
    fig.update_yaxes(range=[0, yLim], showgrid=True, gridwidth=1, gridcolor='Lightgray',
                     showline=True, linewidth=1, linecolor='Lightgray', mirror=True, row=1, col=1)
    # Right
    ind = np.linspace(0, history.shape[0], Table_Rows, endpoint = False).round(0).astype(int)
    ind = np.append(ind, history.Iteration.values[-1])
    history = history[history.index.isin(ind)]
    fig.add_trace(go.Table(header=dict(values = list(history.columns), line_color='darkslategray',
                                       fill_color='DimGray', align=['center','center'],
                                       font=dict(color='white', size=12), height=25), columnwidth = [0.4, 0.4, 0.4, 0.4],
                           cells=dict(values=[history.Iteration, history.Loss.astype(float).round(4).values,
                                          history.Accuracy.astype(float).round(4).values],
                                      line_color='darkslategray', fill=dict(color=['WhiteSmoke', 'white']),
                                      align=['center', 'center'], font_size=12,height=20)), 1, 2)
    fig.show()
In [28]:
Plot_history(history, Table_Rows = 18, yLim = 1)

Confusion Matrix

The confusion matrix allows for visualization of the performance of an algorithm.

In [29]:
def Confusion_Matrix(Model, FG = (12, 4), X_train_tensor = X_train_tensor, y_train = y_train,
                     X_test_tensor = X_test_tensor, y_test = y_test):
    
    font = FontProperties()
    font.set_weight('bold')
    ############# Train Set #############
    fig, ax = plt.subplots(1, 2, figsize=FG)
    _ = fig.suptitle('Train Set', fontproperties=font, fontsize = 16)
    
    # Predictions
    y_pred = model(X_train_tensor.float())
    y_pred = torch.max(y_pred.data, 1)[1]
    y_pred = y_pred.cpu().data.numpy()
    
    # confusion matrix
    CM = metrics.confusion_matrix(y_train, y_pred)
    _ = sns.heatmap(CM.round(2), annot=True, annot_kws={"size": 14}, cmap="Blues", ax = ax[0])
    _ = ax[0].set_title('Confusion Matrix')
    CM = CM.astype('float') / CM.sum(axis=1)[:, np.newaxis]
    _ = sns.heatmap(CM.round(2), annot=True, annot_kws={"size": 14}, cmap="Greens", ax = ax[1],
                   linewidths = 0.2, vmin=0, vmax=1, cbar_kws={"shrink": 1})
    _ = ax[1].set_title('Normalized Confusion Matrix')
    
    for a in ax:
        _ = a.set_xlabel('Predicted labels')
        _ = a.set_ylabel('True labels')
        _ = a.xaxis.set_ticklabels(Labels)
        _ = a.yaxis.set_ticklabels(Labels)
        
    ############# Test Set #############
    fig, ax = plt.subplots(1, 2, figsize=FG)
    _ = fig.suptitle('Test Set', fontproperties=font, fontsize = 16)
    font = FontProperties()
    font.set_weight('bold')
    
    # Predictions
    y_pred = model(X_test_tensor.float())
    y_pred = torch.max(y_pred.data, 1)[1]
    y_pred = y_pred.cpu().data.numpy()
    
    # confusion matrix
    CM = metrics.confusion_matrix(y_test, y_pred)
    _ = sns.heatmap(CM.round(2), annot=True, annot_kws={"size": 14}, cmap="Blues", ax = ax[0])
    _ = ax[0].set_title('Confusion Matrix')
    CM = CM.astype('float') / CM.sum(axis=1)[:, np.newaxis]
    _ = sns.heatmap(CM.round(2), annot=True, annot_kws={"size": 14}, cmap="Greens", ax = ax[1],
                   linewidths = 0.2, vmin=0, vmax=1, cbar_kws={"shrink": 1})
    _ = ax[1].set_title('Normalized Confusion Matrix')
    
    for a in ax:
        _ = a.set_xlabel('Predicted labels')
        _ = a.set_ylabel('True labels')
        _ = a.xaxis.set_ticklabels(Labels)
        _ = a.yaxis.set_ticklabels(Labels)
In [30]:
Confusion_Matrix(model)

Predictions

In [31]:
Sample = df.sample(frac = 0.1)
X_sample = Sample.drop(columns = [Target]).values
X_sample = scaler.transform(X_sample)

if torch.cuda.is_available():
    X_sample_tensor = Variable(torch.from_numpy(X_sample).cuda())
else:
    X_sample_tensor = Variable(torch.from_numpy(X_sample))

Labels = ['No', 'Yes']
y_pred = model(X_sample_tensor.float())
y_pred = np.asarray(y_pred.cpu().detach().numpy())
y_pred = pd.Series(y_pred.argmax(axis=1)).to_frame('Term Deposit Subscription (Predicted)').applymap(lambda x: Labels[0] if x ==0 else  Labels[1])
Predictions = pd.concat([Data, y_pred], axis = 1).dropna(subset = ['Term Deposit Subscription (Predicted)'])
display(Predictions)
Age Job Marital Education Default Housing Loan Contact Month Day Of Week ... Pdays Previous Poutcome Employment Variation Rate Consumer Price Index Consumer Confidence Index Euribor three Month Rate Number of Employees Term Deposit Subscription Term Deposit Subscription (Predicted)
0 56 Housemaid Married Basic.4Y No No No Telephone May Monday ... 999 0 Nonexistent 1.1 93.994 -36.4 4.857 5191.0 No Yes
1 57 Services Married High.School Unknown No No Telephone May Monday ... 999 0 Nonexistent 1.1 93.994 -36.4 4.857 5191.0 No Yes
2 37 Services Married High.School No Yes No Telephone May Monday ... 999 0 Nonexistent 1.1 93.994 -36.4 4.857 5191.0 No Yes
3 40 Admin. Married Basic.6Y No No No Telephone May Monday ... 999 0 Nonexistent 1.1 93.994 -36.4 4.857 5191.0 No Yes
4 56 Services Married High.School No No Yes Telephone May Monday ... 999 0 Nonexistent 1.1 93.994 -36.4 4.857 5191.0 No Yes
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4114 52 Entrepreneur Married University.Degree No No No Telephone May Monday ... 999 0 Nonexistent 1.1 93.994 -36.4 4.858 5191.0 No Yes
4115 55 Services Divorced High.School No No Yes Telephone May Monday ... 999 0 Nonexistent 1.1 93.994 -36.4 4.858 5191.0 No Yes
4116 24 Services Single High.School No No No Telephone May Monday ... 999 0 Nonexistent 1.1 93.994 -36.4 4.858 5191.0 No Yes
4117 46 Admin. Divorced High.School Unknown No Yes Telephone May Monday ... 999 0 Nonexistent 1.1 93.994 -36.4 4.858 5191.0 No Yes
4118 31 Admin. Divorced University.Degree No No No Telephone May Monday ... 999 0 Nonexistent 1.1 93.994 -36.4 4.858 5191.0 No Yes

4119 rows × 22 columns


Refrences

  • S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

  • S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimaraes, Portugal, October, 2011. EUROSIS. [bank.zip]